GroupNormFusion

对输入张量执行分组归一化融合操作（Group Normalization Fusion），在多核环境中按批次拆分并行完成均值、方差与归一化计算。

\[\hat{x}_{(b,u,c)} = \frac{x_{(b,u,c)} - \mu_{(b,g)}}{\sqrt{\sigma^2_{(b,g)} + \epsilon}}, \quad y_{(b,u,c)} = \hat{x}_{(b,u,c)} \cdot scale_c + offset_c\]

其中 \((b,u,c)\) 表示批次、空间位置与通道索引，\(g\) 为通道所属分组。

输入：

input - 输入张量首地址，形状 [batch, unit, channel]。

scale - 通道缩放系数首地址，长度为 channel。

offset - 通道偏移系数首地址，长度为 channel。

mean - 批次 × 分组的均值缓冲区首地址，长度为 batch * num_groups。

variance - 批次 × 分组的方差缓冲区首地址，长度为 batch * num_groups。

epsilon - 数值稳定项。

num_groups - 分组数。

channel - 通道总数。

unit - 每批次内的归一化单元数（H×W）。

batch - 批次数。

core_mask(int, 可选) - 核掩码（仅适用于共享存储版本）。

输出：

output - 写回分组归一化结果的张量首地址。

支持平台：
FT78NE MT7004

备注

FT78NE 支持 fp32 数据类型。

MT7004 支持 fp16、fp32 数据类型。

共享存储版本:

void hp_groupnormfusion_s(const half *input, const half *scale, const half *offset, half *mean, half *variance, float epsilon, int num_groups, int channel, int unit, int batch, int core_mask, half *output)

void fp_groupnormfusion_s(const float *input, const float *scale, const float *offset, float *mean, float *variance, float epsilon, int num_groups, int channel, int unit, int batch, int core_mask, float *output)

C调用示例：

// FT78NE 多核示例
#include <stdio.h>

int main(void) {
    const float *input = (const float *)0xA0000000;     // DDR 存储
    const float *scale = (const float *)0xB0000000;
    const float *offset = (const float *)0xB0001000;
    float *mean = (float *)0xB0002000;
    float *variance = (float *)0xB0003000;
    float *output = (float *)0xC0000000;
    int num_groups = 8;
    int channel = 64;
    int unit = 49;
    int batch = 32;
    float epsilon = 1e-5f;
    int core_mask = 0xff;
    fp_groupnormfusion_s(input, scale, offset, mean, variance,
                         epsilon, num_groups, channel, unit,
                         batch, core_mask, output);
    return 0;
}

私有存储版本:

void hp_groupnormfusion_p(const half *input, const half *scale, const half *offset, half *mean, half *variance, float epsilon, int num_groups, int channel, int unit, int batch, half *output)

void fp_groupnormfusion_p(const float *input, const float *scale, const float *offset, float *mean, float *variance, float epsilon, int num_groups, int channel, int unit, int batch, float *output)

C调用示例：

// MT7004 单核示例
#include <stdio.h>

int main(void) {
    const half *input = (const half *)0x10000000;       // L2 存储
    const half *scale = (const half *)0x10004000;
    const half *offset = (const half *)0x10008000;
    half *mean = (half *)0x1000C000;
    half *variance = (half *)0x10010000;
    half *output = (half *)0x10014000;
    int num_groups = 4;
    int channel = 32;
    int unit = 36;
    int batch = 16;
    float epsilon = 1e-4f;
    hp_groupnormfusion_p(input, scale, offset, mean, variance,
                         epsilon, num_groups, channel, unit,
                         batch, output);
    return 0;
}